Name: Zun Wang
Student ID: 915109847
Remember to include the relevant code from the lab page in this file so that this file will knit.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Exercise 0: Be sure that your final knitted file has good formatting. Make sure that you are using informative variable names.
data.geno <- read_csv("../input/Rice_44K_genotypes.csv.gz",
na=c("NA","00"))
## Warning: Missing column names filled in: 'X1' [1]
## Warning: Duplicated column names deduplicated: '6_17160794' =>
## '6_17160794_1' [22253]
## Parsed with column specification:
## cols(
## .default = col_character()
## )
## See spec(...) for full column specifications.
data.geno <- data.geno %>% select(-`6_17160794_1`)
head(data.geno[,1:10]) #first six rows of first 10 columns
summary(data.geno[,1:10]) #summarizes the first 10 columns
## X1 1_13147 1_73192 1_74969
## Length:413 Length:413 Length:413 Length:413
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## 1_75852 1_75953 1_91016 1_146625
## Length:413 Length:413 Length:413 Length:413
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## 1_149005 1_149754
## Length:413 Length:413
## Class :character Class :character
## Mode :character Mode :character
data.geno <- data.geno %>% rename(ID=X1)
head(data.geno[,1:10])
Exercise 1: Create a data subset that contains a random sample of 10000 SNPs from the full data set. Place the smaller data set in an object called data.geno.10000. Very important: you want to keep the first column, the one with the variety IDs, and you want it to be the first column in data.geno.10000. AND You do not want it to show up randomly later on in the data set. Think about how to achieve this.
?append
data.geno.10000 = data.geno[,append(1, sample(2:ncol(data.geno),10000))]
data.geno.10000